feat(seqopt): pure-Python EA operators + DEAP parity + SeqOptPlot (protein engineering)#271
Merged
Merged
Conversation
Pure-Python (no runtime dep), closing the gaps from the NSGA-II-only first cut: - variation varAnd/varOr; survival mu_plus_lambda/mu_comma_lambda/ea_simple - constraints (feasibility callables) with DeltaPenalty / ClosestValidPenalty - single-objective Hall of Fame (SeqOpt.hall_of_fame_) beside the Pareto archive - convergence metric (generational distance to a ref_front) in eval - engine='exact'|'fast' (numpy-vectorized non-dominated sort; numerically identical front, faster); crowding now uses DEAP's nobj*span normalization DEAP parity (dev/test-only oracle; runtime stays DEAP-free): - deap added to [dev]; test_seqopt_deap_parity.py asserts our sort/crowding/selNSGA2 reproduce DEAP's sortNondominated/assignCrowdingDist/selNSGA2 on synthetic fitness (identical rank incl. ties, crowding values+ordering within atol, selNSGA2 profile) - Phase-C comparison (.github/scripts/seqopt_deap_comparison.py): ours-exact/fast vs DEAP, correctness + wall-clock + peak memory -> ship-ours (fast beats DEAP, e.g. ~14ms vs ~102ms at 500x3, dependency-free) Docs: ADR-XXXX (number-last), CONTEXT.md EA-operator/engine/convergence terms, release note. 85 SeqOpt tests + 447 in the broader gate green; docstrings/param coverage clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pin what is actually invariant vs DEAP: non-dominated rank always identical (incl. ties); crowding values+ordering and selNSGA2 survivor profile identical on continuous fitness (within 1e-9); survivor rank-distribution identical under heavy ties (the exact tied-individual kept is arbitrary in DEAP too). Drops the over-strict exact-set/profile-under-duplicates claims that don't hold (boundary points tie at inf crowding even for continuous objectives).
…ence history Visualization (SeqOptPlot): new convergence (per-generation hypervolume + spread + per-objective best, from the new SeqOpt.history_), 3-D pareto_front (optional z), and parallel_coordinates for many-objective fronts. Per-generation history is now tracked (spread + per-objective best, not only hypervolume) and exposed as SeqOpt.history_. Objectives: a callable source now receives the variant SEQUENCE (fn(sequence)-> float) and is cached per distinct variant, so any external predictor — a scikit/ torch model or a sequence-level tool / web API — can be optimized jointly with the model-on-features objectives; pure-callable multi-objective runs need no CPP model. Two executed example notebooks (seqopt_convergence, seqopt_parallel_coordinates) demonstrate the views + the external-predictor recipe. 53 SeqOpt frontend tests (+19) and 459 in the broader gate green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite all six SeqOpt example notebooks around a real task — 'design a super
gamma-secretase substrate': load_features('DOM_GSEC') (150 CPP features) +
load_dataset('DOM_GSEC') + a simple RandomForest, take a non-substrate wild-type
and mutate its TMD to maximize predicted substrate probability with few mutations.
They demonstrate run (nsga2/greedy, impact/importance, varOr/ea_simple/operators,
constraints + Hall of Fame, external-predictor callable objective), eval
(hypervolume/spread/convergence), and all four SeqOptPlot views (pareto_front 2-D/
3-D, parallel_coordinates, convergence, hypervolume), with executed outputs.
Fix (found via the realistic reference): SeqOpt mode='impact' refit kept the FULL
df_seq_ref, so a reference from load_dataset (carrying jmd_n/tmd/jmd_c/label) NaN-
tripped check_df_seq on the appended variant row. Now keep only the position-based
columns; add a regression test with an extra-column reference.
460-test broad gate + docstrings clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hero plots Critical-assessment improvements: - engine='fast' non-dominated sort now computes the dominance matrix in row-chunks (adaptive block), bounding the transient to O(block*n*m) vs O(n^2*m) — ~2.6x leaner peak memory at n=3000; identical fronts (parity unchanged). Realistic pool sizes were never a problem; this makes pathologically large populations safe. - run() keeps a cumulative non-dominated archive (DEAP ParetoFront analogue), merged into the final population so the returned rank=0 front is the best-ever set — no solution lost to per-generation crowding truncation. - history_ now tracks per-objective best/mean/worst per generation. Hero plots (the genre's standard views): - SeqOptPlot.mutation_map — position x amino-acid substitution-enrichment heatmap across the front (the directed-evolution 'which mutations won' view). - SeqOptPlot.convergence gains the classic GA best/mean/worst fitness band. New executed notebook seqopt_mutation_map; tests for mutation_map + the band + archive. 465-test broad gate + docstrings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- SeqOptPlot.genealogy: mutational-lineage tree (wild-type -> variants by accumulated mutations, linked by mutation-set containment, colored by the first objective) - the directed-evolution analogue of a genealogy tree, matplotlib-only (no networkx). - SeqOpt class docstring now carries a rendered list-table mapping every run/eval method + parameter value to its DEAP function (selNSGA2/varAnd/varOr/eaMuPlusLambda/ cxUniform/DeltaPenalty/...), with the aaanalysis-only rows called out. - New executed seqopt_genealogy notebook + tests. 273-test gate + docstrings clean.
…onsistency) pareto_front / parallel_coordinates / mutation_map / genealogy take a user-overridable cmap= (defaults unchanged), matching the CPPPlot / AAMutPlot / SeqMutPlot convention of colormap-as-parameter instead of a hardcoded name.
tutorial7_protein_design: an executed end-to-end case study — train a GSEC substrate classifier, design a 'super substrate' from a non-substrate, and read the result with every SeqOptPlot view (pareto_front 2-D/3-D, convergence, mutation_map, genealogy, parallel_coordinates) plus SHAP-guided impact mode. Wired into the Tutorials toctree under a new Protein Design section.
…) + refs Draw the paradigm distinction clearly in the SeqOpt class docstring, the tutorial, and CONTEXT.md: SeqOpt does protein *engineering* — machine-learning-guided directed evolution of an existing sequence [Yang19] — explicitly NOT de novo protein design (generating new proteins). Introduce de novo design as the contrasting paradigm via the canonical structure-first pipeline RFdiffusion [Watson23] -> ProteinMPNN [Dauparas22] -> AlphaFold [Jumper21], reviewed in [DeNovoReview26]. Add all five references to references.rst; tutorial retitled 'Protein Engineering with SeqOpt' with the distinction + hyperlinked refs; Tutorials toctree section renamed. Docstring citations resolve (0 defects); 104-test gate green.
…ce reviews Read the two provided reviews and fixed the citations: ML-guided directed evolution is Wittmann, Johnston, Wu & Arnold (2021), Curr. Opin. Struct. Biol. (not the Yang19 I had guessed); the de novo design review is Yang et al. (2026), Nature 652:1139. Sharpened the distinction in the SeqOpt docstring + tutorial + CONTEXT.md using the reviews' own framing (de novo = build new proteins from the ground up; engineering = iterative mutation/selection of an existing protein, ML learns the fitness model). Citations resolve; 64-test gate green.
…ed evolution Add [Yang19] (Nature Methods 2019, the foundational ML-guided directed-evolution-for- protein-engineering review) alongside [Wittmann21] in the SeqOpt docstring, tutorial and CONTEXT.md. Citations resolve.
…rity # Conflicts: # docs/source/index/release_notes.rst
…erators decision Number the previously number-less parity ADR (one past the current master max 0044 = find-features protocol), set status Accepted, regenerate INDEX.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #271 +/- ##
==========================================
+ Coverage 96.10% 96.13% +0.02%
==========================================
Files 175 176 +1
Lines 16374 16733 +359
Branches 2796 2863 +67
==========================================
+ Hits 15737 16087 +350
+ Misses 369 363 -6
- Partials 268 283 +15
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Completes the parity-first half of #261 (deferred from PR #267) and substantially extends
SeqOpt(pro). Builds on ADR-0043; recorded in ADR-0045.Pure-Python EA operator set (DEAP-free runtime)
Beyond the NSGA-II core: varAnd/varOr variation; (μ+λ)/(μ,λ)/eaSimple survival; constraints (DeltaPenalty / ClosestValidPenalty); uniform/one-/two-point crossover; substitution/shift mutation; single-objective Hall of Fame; a cumulative Pareto archive (rank-0 = best-ever, none lost to crowding); hypervolume / spread / convergence metrics. Objectives accept any
callable(sequence) -> float(external scikit/torch model or web API), cached per variant.DEAP parity (dev/test-only oracle)
deapadded to[dev]only.test_seqopt_deap_parity.pyproves oursortNondominated/assignCrowdingDist/selNSGA2match DEAP — identical rank (incl. ties), crowding values+ordering withinatol, survivor profile. Phase-C comparison (.github/scripts/seqopt_deap_comparison.py): ours-fastis 3–7× faster than DEAP and dependency-free → ship ours.engine="exact"|"fast"give identical fronts;fastis memory-bounded (chunked, 2.6× leaner at n=3000).Visualization (SeqOptPlot)
pareto_front(2-D/3-D),parallel_coordinates,convergence(best/mean/worst band),hypervolume,mutation_map(front substitution-enrichment heatmap),genealogy(mutational-lineage tree).cmapis a parameter throughout (package convention).Docs / framing
The class docstring carries a DEAP-mapping table and clearly frames SeqOpt as protein engineering (ML-guided directed evolution, [Yang19]/[Wittmann21]) vs de novo design (RFdiffusion→ProteinMPNN→AlphaFold, [Yang26]). New: 8 per-method example notebooks (realistic GSEC "super-substrate" task) + tutorial7_protein_engineering.
Bugs fixed (found via realistic data)
mode="impact"kept the fulldf_seq_ref→ NaN-trippedcheck_df_seqwhen the reference came fromload_dataset; now position-cols only (+ regression test).Verification
469-test broad gate green locally (SeqOpt suite, all meta-tests, docstrings, parity); merged current with master.
🤖 Generated with Claude Code